Information Aggregation Using the Caméléon# Web Wrapper
نویسندگان
چکیده
Caméléon# is a web data extraction and management tool that provides information aggregation with advanced capabilities that are useful for developing value-added applications and services for electronic business and electronic commerce. To illustrate its features, we use an airfare aggregation example that collects data from eight online sites, including Travelocity, Orbitz, and Expedia. This paper covers the integration of Caméléon# with commercial database management systems, such as MS SQL Server, and XML query languages, such as XQuery.
منابع مشابه
The Camaleon Web Wrapper Engine
The web is rapidly becoming the universal repository of information. A major challenge is the ability to support the effective flow of information among the sources and services on the web and their interconnection with legacy systems that were designed to operate with traditional relational databases. This paper describes a technology and infrastructure to address these needs, based on the des...
متن کاملData Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملSite-Wide Wrapper Induction for Life Science Deep Web Databases
We present a novel approach to automatic information extraction from Deep Web Life Science databases using wrapper induction. Traditional wrapper induction techniques focus on learning wrappers based on examples from one class of Web pages, i.e. from Web pages that are all similar in structure and content. Thereby, traditional wrapper induction targets the understanding of Web pages generated f...
متن کاملLearning Wrappers Efficiently for Web Information Extraction Using Unlabeled Examples
In this paper, we describe techniques for learning wrappers efficiently using very few user-supplied labels (typically, 1 or 2 labels, all within a single page). This is an improvement over previous work, which require multiple labeled examples on multiple pages. In effect, it brings the power of the wrapper down to the level of the end-user, who can teach, by only a few demonstrations, the lab...
متن کاملRecognizing Structure in Web Pages using Similarity Queries
We present general-purpose methods for recognizing certain types of structure in HTML documents. The methods are implemented using WHIRL, a "soft" logic that incorporates a notion of textual similarity developed in the information retrieval community. In an experimental evaluation on 82 Web pages, the structure ranked first by our method is "meaningful"--i.e., a structure that was used in a han...
متن کامل